laggard data pipeline
Stochastic Optimization with Laggard Data Pipelines
State-of-the-art optimization is steadily shifting towards massively parallel pipelines with extremely large batch sizes. As a consequence, CPU-bound preprocessing and disk/memory/network operations have emerged as new performance bottlenecks, as opposed to hardware-accelerated gradient computations. In this regime, a recently proposed approach is data echoing (Choi et al., 2019), which takes repeated gradient steps on the same batch while waiting for fresh data to arrive from upstream. We provide the first convergence analyses of data-echoed extensions of common optimization methods, showing that they exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with stochastic minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
Review for NeurIPS paper: Stochastic Optimization with Laggard Data Pipelines
Clarity: The paper writing is very good, but I find several small problems related to notations, which could make confusion: - Between line 108-109, the authors use both the \bf\xi with a supscript "t" and the \bf\xi without a supscript "t", I guess for the latter the authors mean a general batch of samples does not depend on "t", but it is not explained clearly. Also, sometimes it has "i" in the supscript while othertimes it has "i" in the subscript. However, the reuse of the same notation really makes me confused for a while since it looks like \xi is some element belong to \bf\xi or \bf\xi'. Is this a proof artifact? It makes more sense that if we want to do an averaging here, the w_t's should better have different weights such that the recent updates get higher score.
Review for NeurIPS paper: Stochastic Optimization with Laggard Data Pipelines
The paper is a theoretical analysis of the behaviour of "echoed gradients" in convex optimization. The investigation is timely, and will cast light on an interesting area of current practice. More than one reviewer believes the paper should explicitly handle the non-convex case. I disagree, and side with the authors that the convex case is sufficient. The relevant non-convex optimizers generally contain convex stepping as a subprogram, so this analysis is reasonable.
Stochastic Optimization with Laggard Data Pipelines
State-of-the-art optimization is steadily shifting towards massively parallel pipelines with extremely large batch sizes. As a consequence, CPU-bound preprocessing and disk/memory/network operations have emerged as new performance bottlenecks, as opposed to hardware-accelerated gradient computations. In this regime, a recently proposed approach is data echoing (Choi et al., 2019), which takes repeated gradient steps on the same batch while waiting for fresh data to arrive from upstream. We provide the first convergence analyses of "data-echoed" extensions of common optimization methods, showing that they exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with stochastic minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.